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ABSTRACT 

XML transactions are used in many information systems to 
store data and interact with other systems. Abnormal trans- 
actions, the result of either an on-going cyber attack or the 
actions of a benign user, can potentially harm the inter- 
acting systems and therefore they are regarded as a threat. 
In this paper we address the problem of anomaly detection 
and localization in XML transactions using machine learn- 
ing techniques. We present a new XML anomaly detection 
framework, XML-AD. Within this framework, an automatic 
method for extracting features from XML transactions was 
developed as well as a practical method for transforming 
XML features into vectors of fixed dimensionality. With 
these two methods in place, the XML- AD framework makes 
it possible to utilize general learning algorithms for anomaly 
detection. Central to the functioning of the framework is 
a novel multi-univariate anomaly detection algorithm, AD- 
IFA. The framework was evaluated on four XML transac- 
tions datasets, captured from real information systems, in 
which it achieved over 95% true positive detection rate with 
less than a 3% false positive rate. 

Categories and Subject Descriptors 

C.2.0 [General]: Security and protection; K.4.4 [XML Trans- 
actions]: Security; 1.2.6 [Learning]: Concept and parame- 
ters Learning 

General Terms 

Security, Algorithms, Design 

1. INTRODUCTION 

Today, many information systems communicate and inter- 
act through XML transactions. These transactions may fall 
victim to cyber attacks or even benign mistakes which can 
alter the structure and content of their interaction media, 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 
ASCAS28 ' 12 Orlando, Florida, USA 

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 



Alon Scholar 

School of Computer Science 
The Academic College of Tel-Aviv-Yafo 
Tel Aviv, Israel 61083 

alonschc@mta.ac.il 



i.e., the XML documents. Regardless of whether the origin 
of these alterations is malicious or benign, the altered XMLs, 
especially those that adhere to the XSD schema, can poten- 
tially exploit any vulnerabilities of the interacting informa- 
tion systems. Since the alternated XMLs process can gen- 
erally render the XMLs anomalous with respect to the ma- 
jority of the XML transactions in the same domain, detect- 
ing anomalous XMLs is an important means of increasing 
the security of many information systems. Unfortunately, 
state-of-art, end-to-end security protocols for XML transac- 
tions, i.e., XML encryption [f5], XML signature [f6], and 
XML-canonicalization [f4] provide little protection against 
such a threat. This is mostly because the alteration actions, 
which deform the XMLs, take place before such protective 
measures are applied at the endpoint's systems. It follows 
that in addition to the above-mentioned security protocols, 
XML documents should be subjected to anomaly detection 
system prior to being consumed by the endpoint information 
systems. 

Extensible Markup Language m\ is a framework that facil- 
itates the definition of structured markup languages. Data 
in such languages is described by documents in which ev- 
ery datum is encapsulated by tags. The XML files, bound 
to their definition in an XSD schema can vary considerably 
and two XMLs that adhere to the same XSD schema, can 
have very different attributes regarding both their content 
and structure. 

1.1 XML Anomalies 

Anomalies are data patterns which are either very rare or 
novel. In the scope of this paper, the anomalous patterns are 
related to both the structure and content of an XML docu- 
ment. Such patterns can be generated by either actions, of 
which intentions are malicious (i.e., cyber attack) or benign 
(user mistake or a technical error). Next, we describe the 
two most prominent anomalous patterns generators. 

XML attacks 

Applications that interact through XML messages, such as 
various Web-services, are essentially vulnerable to a wide 
range of malicious attacks. These attacks exploit various 
vulnerabilities in the XML processing mechanism, such as 
the vulnerability of XML parsers or the weak points in in- 
put verification in the target server application. Among the 
prominent attacks of this type are input validation attacks 
[26] ; probing [39]; malware infiltration; buffer overflow [39| 



26] ; XML parameter poisoning 39, 36 ; CDATA field at- which only the normal class is being taught, but the algo 



tacks [39] [36]; SQL injection [39][36f[26 ; cross-site scripting 
[26] ; schema poisoning [23]; denial of service (DoS); DDoS 
aA§ XML bombardment; DOM parser DoS attacks; XML 
Bomb [37] and repetition attack. 

Another threat to modern information systems arises from 
data leakage. Among the causes of data leakage are Tro- 
jan attacks, SQL injection attacks, or simple human error. 
There are many ways that outgoing XML transactions can 
lead to data leaks in the system. The simplest way results 
from putting all the data, as it is, in one field that is not 
properly constrained by a regular expression. A simple vari- 
ation of this scheme is a division of the data into several parts 
and embedding it into many different fields of the XML file. 

Benign Anomalies 

Not all XML anomalies are a product of a cyber attack or 
a malicious action. There are many ways in which XML 
documents might become anomalous. User mistakes, appli- 
cation errors and communication errors are typical examples 
of how benign anomalies might be induced in XMLs. 

1.2 Problem Statement and Applicability 

The present work focuses on the problem of detecting and 
localizing anomalies in readable XML documents at com- 
puter endpoints. The algorithms that presented in this pa- 
per aim to detect anomalies that stem from either malicious 
or benign actions. In our opinion, it is important to detect 
both types of anomaly. The rationale behind this approach 
is the assumption that XML anomalies, regardless of their 
nature, have the potential to invoke unwanted effects in the 
information processing system. 

We would like to stress that the present work does not try to 
infer the nature of the detected anomalies since this would 
requires elaborate forensic work and an understanding of the 
anomalies semantics. Consequently, we use XML- AD only as 
an indicator for what could be a network attack, which is 
being borne by XML documents. As such, XML-AD can be 
applicable, for example, for endpoint anomaly-based XML- 
Firewalls. 

1.3 Anomaly Detection 

Anomaly detection [2] [3] [i] [2l] [25] [29] is a process aimed 
at discovering patterns in datasets that deviate from the be- 
havior or the expect behavior of the majority of the data. 
Anomaly detection can be found in a broad spectrum of ap- 
plications such as intrusion detection, cyber-security, fraud 
detection, financial systems, and military surveillance to 
name a few. Anomaly detection methods employ a wide 
range of techniques that are based on statistics, classifica- 
tion, clustering, nearest neighbor search, information theory 
and spectral analysis. 

In many domains, such as with XML transaction-based 
systems, there might be in practice an infinite number of 
anomalous patterns, of which are very rare and hard to 
obtain. In such cases, the most conventional learning ap- 
proach, i.e., supervised-learning, is impractical since train- 
ing a supervised classifier demands at least a single example 
from each of the patterns that must be classified. Moreover, 
in many real-life domains, normal state examples are in- 
herently easier to obtain than anomalous state examples. In 
such domains, researchers take the semi-supervised anomaly 
detection approach (also known as one-class learning), in 
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The semi-supervised anomaly detection is a suitable ap- 
proach for training anomaly detection for XML transactions 
when one assumes that at the time of training, only normal 
XML examples exist. 

1.4 Similarity-Based Anomaly Detection 

In order to classify a new XML example as normal or 
as an anomaly, most anomaly detection algorithms consider 
the similarity of the new example to a group of XML docu- 
ments, labeled normal. The general idea is that if the new 
example is similar to the normal instances, it should be la- 
beled normal; otherwise, it should be labeled anomaly. Such 
a similarity function, denoted as S(-), takes two parame- 
ters: a new instance, x new , and a group of XML documents, 
labeled as normal, X {normal} . One way to calculate such 
a similarity score is by computing the distance from x new 
to the closest normal XML document, i.e., S(x new , X) = 
inf xeX {normai}d(x n ew,%), where a distance function, d(-) 
, is a pairwise distance function. The distance function, 
d(xnem,x), computes a scalar that reflects the XMLs mu- 
tual similarity in the feature space. Finding such a distance 
function, which is able to separate the normal documents 
from the anomalous documents in the feature space, is a 
fundamental challenge in XML anomaly detection [24] , 

Most similarity-based anomaly detection algorithms use 
a multivariate vector distance function, which is a plausible 
approach in many domains, especially those in which data 
is represented by low-dimensional vectors. However, multi- 
variate distance functions are inherently susceptible to the 
"curse of dimensionality" [EJ. Consequently, the functions 
become much less accurate as the number of dimensions 
grows since between any two points in the given dataset 
the distance converges, rendering the concept of distance 
meaningless. Another weak point one finds in similarity- 
based anomaly detection algorithms that use multivariate 
vector distance functions is that they are unable to indicate 
the specific dimension(s) in the new vector that incurs the 
anomalous pattern. As a result, the algorithms do not allow 
localization of the anomaly pattern source. 

To avoid the weaknesses of the abovementioned similarity- 
based anomaly detection algorithms, the approach taken in 
this study is to use multiple, univariate distance functions. 
This approach was previously used in several domains, in- 
cluding intrusion detection [35 : 38, 28, [7], however, it was 
not yet applied to detect XML anomalies. We show that our 
approach results in very accurate XML anomaly detection 
and makes it possible to localize the dimensions (i.e., the 
XML features), which incurred the anomalous patterns. 

Paper outline 

The rest of this paper is organized as follows. In Section[2]we 
present related work and discuss the need for a new anomaly 
detection framework for XML transactions. In Section [3] 
we present XML-AD, our anomaly detection framework. In 
Section [4] we discuss the methods, classifiers, datasets and 
performance metrics used for evaluating XML- AD. In Sec- 
tion[5] we present several evaluation experiments and discuss 
their results. In Section [5] we summarize the contributions 
of this paper and discuss future directions. 

2. RELATED WORK 
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Figure 1: The conceptual architecture of XML- AD 



Surprisingly, despite the risk XML documents anomalies, 
there has been very little relevant research underway. In 
the following paragraphs, we describe available methods for 
anomaly detection of XML forms. All these methods take 
the semi-supervised anomaly detection approach. 
Bruno et al. [9] [To] propose a method for detecting in 
datasets frequently occurring relationships that correspond 
to the normal behavior of the data. The detection method 
uses association rules and the relationships are represented 
as quasi-functional dependencies. Anomalies are discovered 
by querying either the original database or the previously 
mined association rules to indicate the presence of erroneous 
data or novel information that represents the outliers of fre- 
quent rules. The method is independent of the considered 
database and directly infers rules from the data. In [8], an 
incremental approach is used to extend the method in [9j 
10 to handle dynamic databases where the anomalies must 
be updated according to changes that the data undergoes. 

Premalatha and Natarajan 31 mine negative association 



rules [40] which are used to describe those relationships be- 
tween item sets that indicate the occurrence of some item 
sets by the absence of others. The chi-square test is used to 
identify independent attributes and the anomalies are iden- 
tified as a negative association rule whose confidence value 
is greater than a minimum confidence threshold. Unfortu- 
nately, domain knowledge of the data sets is incorporated 
into filter rules, a step that does not contribute to the de- 
tection process. 
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use a probabilistic inference for clas- 
sification and anomaly detection of structured documents 
which they test on XML documents. Specifically, they ex- 
tract a feature vector from every XML document according 
to the number of attributes each tag can have. The fea- 
tures are learned and represented in a factorized form as a 
product of pairwise joint probability distribution functions 
according to a method introduced by Chow and Liu |11| . 
Anomaly is detected by applying an acceptance threshold 
to the probability values. The authors indicate that this 
threshold should be trained and adapted for databases that 
are subject to quick changes. 
Raz et al. 
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Specifically, invariants such as value interval and arithmetic 
expressions are extracted and used as proxies to detect anoma- 
lies. The detection method is demonstrated for semantic 
anomalies, i.e., values that are syntactically correct but which 
have unreasonable values. Two type of invariants are ex- 
tracted, namely, the mean statistics and invariants that are 
produced by an adjusted version of a software for detecting 



invariants in computer programs. 



3. XML-AD: AN AUTOMATED, CONTEXT- 
LEARNING OF XML DOCUMENTS 

We propose a new framework, XML-AD, for training a 
classifier for detecting and localizing XML anomalies. The 
framework comprises three stages: feature extraction, dataset 
generation and a machine-learning model training. The in- 
put for the training process is a corpus of training XML 
transactions and a single XSD file, which defines the trans- 
actions at hand. 

In the first stage, the raw transaction features are extracted. 
In order to do so, the XSD is parsed and the meta data it 
contains, i.e., XML elements definition and constraints, is 
extracted. Next, the entire corpus is put through a feature 
extraction procedure. The meta-data extracted from the 
XSD is used to select the suitable feature extraction meth- 
ods for each of the available XML elements. The second 
stage is dataset generation, in which the transaction features 
are aggregated and arranged in tuples, each containing mul- 
tiple derivated (aggregated) features of a single transaction. 
These tuples are then added to a dataset (the train-set). In 
the third and final stage, the anomaly detector is trained by 
applying the ADIFA algorithm (section 3.3 1 on the train- 
set, generated in the previous stage. The above mentioned 
process is depicted in Figure [l] 

3.1 Transactions Feature Extraction 

The feature extraction process starts with the acquisition 
of the definitions of the XML elements. This is achieved by 
parsing the XSD file. The XML elements' definitions are 
then stored in a data structure that we denote as Vxsd- 
Next, the transactions are processed and their features are 
extracted and stored in the features matrix (F m ). The F m 
matrix will contain a single row for each XML transaction. 
Each row is comprised of multiple complex features, one for 
each element definition stored in Vxsd- Since XML struc- 
ture allows repetition of elements within the same document, 
a complex feature may contain multiple occurrences of the 
same element. Therefore, each complex feature contains 
a list, {mvi,mv2, . . .}, we denote as 'measurment-vector' 
(mt)), which stores information regarding the occurrences of 
the related element. Each term in the 'measurment-vector' 
correspond to a single element occurrence. Finally, each mv 
contains I scalar measurements, {mi,m2, . . . , mi}, that cor- 
responds to I attributes of each XML element occurrence. 

In Figure [2] we show an XSD file that defines three vari- 
ables: Payment Amount, Py Value and Name. The first vari- 



<?xml version=" 1 .0" encoding="UTF-8" ?> 
- <xsd:schema xmlns:xsd=".. ."> 
- <xsd: element name="TXLife"> 

- <xsd:element name="PntAmt"> 

- <xsd:complexType> 

- <xsd:simpleContent> 
<xsd:extension base="xsd: double " i> 

</xsd:simpleContent> 
</xsd:complexType> 
</xsd:element> 

- <xsd:element name="Py Value "> 

- <xsd:complexType> 

- <xsd:simpleContent> 
<xsd:extension base="xsd: enumeration" /> 

</xsd:simpleContent> 
</xsd:complexType> 
</xsd:element> 

- <xsd:element name="Name"> 

- <xsd:complexType> 

- <xsd:simpleContent> 
<xsd:extension base="x$d: String" f> 

</xsd:simpleContent> 
</xsd:complexType> 
</xsd:element> 

</xsd: schema 

Figure 2: An example of an XSD file 

able is defined as XSD:double, which means that a numerical 
value may be assigned to it. Py Value was defined as enu- 
meration, so only a pre-defined (not visible in this example) 
integer may be assigned to it. Last, the Name variable is 
defined as XSD:Strmg indicating that any textual symbols 
may be assigned to it. In Figure [3] we can see that Vxsd 
contains one descriptive object for each defined element in 
the XSD file shown in Figure [2] A descriptive object is a 
simple container that carries the type of the XML element 
(i.e. numeric, date, binary etc.). In case of enumeration 
the descriptive object also contains values ranges (as with 
Py Value above) . To avoid the complexity of handling many 
XSD data-types, we have found that it was enough to deal 
with only a few abstract data types: Numerical, Enumera- 
tion, String and Date. Figure [4] exemplify the F m matrix 
related to the above XSD example. 



Vxsd = {■■-, PntAmt, PyValue, Name, ...} 

/ \ \ 

[Numerical [Enumeration {(), 1 ,3,5 }f [String] 



Figure 3: A part of the XSD vector, Vxsd, pro- 
duced by the XSD parser. Three objects are Visi- 
ble: PntAmt, PyValue and Name of type Numerical, 
Enumeration and String respectively. 

3.2 Dataset Compilation 

Traditionally, the large majority of machine-learning al- 
gorithms require that two conditions regarding their input 
datasets be met: (1) all instances must have the same num- 
ber of features, and (2) all features must be scalars (as op- 
posed to the complex features which have inner structure). 
While transactions instances in F m contain the same number 
of complex features, they may contain a different number of 
inner data items (e.g., measure- vectors and measurements). 
Thus, the second condition is not being met. 

To overcome this problem, F m should be flattened. No- 
tice that flattening F m can yield only two situations. In 
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Where, for example, PntAmt has two occurrences in the XML, each 
contains two scalar measurements: «zv,={va/«e=1500, depth=2} and 
mv 2 ={value =2500, depth=3} 



Figure 4: The extracted features matrix, F m . m and 
n are the number of transaction instances and the 
number of complex features respectively. 

the first case, the number of features per instance is not 
constant and consequently, the derived dataset will not be 
rectangular. Therefore, the first condition mentioned above 
is not met. In the second option, we keep the number of 
features per instance constant, by discarding some of the in- 
stanceaAZs information. The benefit of choosing the second 
alternative is that many usable generic machine-learning al- 
gorithms can be used later to train an anomaly detection 
model. When discarding data items, the loss of information 
is inevitable, but can be minimized by discarding only the 
data that would probably not affect the anomaly detection 
in any case (e.g., data with no extreme values). 

The chosen method of flattening F m is to aggregate mul- 
tiple data-items in the complex-features into a fixed number 
of scalars. The number of scalars depends on the number 
of aggregation functions used. For numerical and time/date 
complex- features, we use Maximum and Minimum aggre- 
gation functions. Accordingly, only the maximal and mini- 
mal values of the complex featureaAZs data-times are used 
as simple features. For enumerating complex-features we 
use the Sum aggregation function. By summing every oc- 
currence of each possible enumeration value, the number 
of generated simple-features depends on the number of el- 
ements in the enumeration. String complex-features are 
treated slightly differently. In such case, we first create a 
dictionary using our training-set. Next, we compute word 
TF-IDF (term frequency inverse document frequency) val- 
ues. Only the k most prominent words are chosen as simple 
features. When creating the dataset, we sum up the oc- 
currences of each of the k most prominent words at each 
instance. In this way each transaction instance contains k 
simple features and each indicates the number of occurrences 
of a chosen word in that instance. 

Finally, at the end of the flattening process the dataset 
meets both of the required conditions mentioned above: in- 
stances of the same dimensionality and all the features are 
scalars. 

3.3 The Anomaly Detection Model 

The new algorithm we propose for anomaly detection, AD- 
IFA (Attribute Density Function Approximation), was in- 
spired by the Parzen- Window density estimation method 
[27| . Similar to Parzen- Window, ADIFA takes the approach 
of non-parametric density estimation and assumes a prob- 
abilistic generative model for the observed data. ADIFA 



is a semi-supervised and a meta-learning based algorithm. 
The algorithm learns multiple univariate models, each of 
which is responsible for approximating the density-function 
of a single related attribute, hence, permits a per-attributc 
anomaly detection. When a test XML document is given, 
the univariate density-function models compute a series of 
per-attribute normality scores. These scores are then com- 
bined via yet another univariate density- function model (i.e., 
the meta-classifier), which outputs the ADIFA prediction. 
We begin by discussing some preliminaries and then give a 
formal definition of ADIFA. 

Preliminaries 

Let T — {xi, X2, ■ ■ ■ , x m } be a training sample containing m 
instances, drawn i.i.d. from X {normal} , the group of normal 
instances. We assume that the instances in T well repre- 
sent the normal class. For our purpose, it is convenient to 
represent T as a bag of n independent features (attributes): 



Let A 



j(l<j<n) 



T = {A u A 2 ,...,A n } 
{ai, a%, . . . , a m } be a finite series of values, 



drawn i.i.d. from unknown density function dj to be ap- 
proximated. Let a; £ A be an instance to be classified. The 
ADIFA tasks are: (1) Approximate the density functions 
dj(i<j< n ), an d then, using these approximate models, and 
(2) Calculate the likelihoods (or probability) that instance 
x was drawn from X {normal} . 

In order to safely use a set of univariate models for anomaly 
detection in multivariate vectors, as proposed in this algo- 
rithm, the next proposition should hold with high probabil- 
ity: Vxi , X2 G X 

D( Xl \T) > D(x 2 \T) => 

*(di(xi,i|Ai),d 2 (xi, 2 |A 2 )...) > *(dx(a!2,i|j4.i),da(a;a,2|A2)...) 

where D(-) is a multivariate anomaly detection model; d(-) 
is a per-attribute univariate anomaly detection model; and 
is an aggregation function (e.g., arithmetic mean, ge- 
ometric mean, and harmonic mean). D(x\\T) is the nor- 
mality score of instance x\, with respect to the train-set T, 
whereas di(xi t i\Ai) is the normality score of the i th attribute 
of instance x\. 

In other words, we assume that anomalous patterns can be 
effectively detected by aggregating attribute-wise normality 
scores. Our experimental results show that in many do- 
mains, especially in computer and network security, this as- 
sumption holds. Notice that in order to cover all anoma- 
lous patterns, one should also address uncommon situations 
in which examples have an anomalous combination of nor- 
mal values. This can be done, for example, by learning 
association-rules. 

3.4 Training Process 

First, a set of n density-function models D = {d\, d%, . . . , d n } 
are learned. A model, dj : R — » [0, 1] is responsible for ap- 
proximating the density-function of the corresponding at- 
tribute, Attrj(0 < j < n). To calculate the attribute- wise 
normality scores of a given test instance, x* =< x*,...,x^ >, 
we use the Gaussian radial-basis function(RBF) with a nor- 
malizer bj — (27r<7j 2 )~2 : 



<t>A x *) = Jf^2 b i * pA\\ 



where Ojj is the jth attribute of train instance i, and m 
is the cardinality of Aj. For the RBF function, we chose the 
exponential-decay distance function: 



Pj(\\ a i,: 



=?!!) = ■ 



Tj(aij-x') 2 



The coefficient tj controls the similarity decay speed, which 
also controls the smoothness of the density function and 
therefore its generalization power. The per-attribute, den- 
sity approximation model, d(-) is defined as follows: 



1 - 1 

dj{x*\Aj) = fa{x*) = — V * e 



m ^— * /„ ~ 2 

= 1 \l iTTOj 



(1) 



To compute the instance- wise normality score ,s(x), we 
use an aggregate function on the per-attribute normal- 
ity scores as follow: 



s(x) = *$li[ay *dj{xj\Aj)] 



(2) 



where the weights, a\... n £ [0, 1], reflect the complexity 
of learning the corresponding approximation models. The 
weights can be obtained by using methods such as the In- 
formation Gain [32] or by the techniques of 19 , which are 
based on the doubling dimension theory. 

Lastly, the training is completed after computing the nor- 
mality scores of the training instances. This is done by ap- 
plying Equation [2] to each of the m instances. Each training 
instance is used only once as the test instance, while the 
rest m — 1 instances comprise the train-set T. Let S be 
the normality scores group of the training instances, i.e., 
S = {s(xi), s(x 2 ), . . . , s(x m )}. 

3.5 Classification with ADIFA 

The approach taken by the proposed algorithm for clas- 
sifying a test instance is simple; calculate the likelihoods 
of obtaining an instance normality score, such as that of 
the test instance. Consequently, the classification is done in 
two steps. First, the test instance's normality score, s(x*), 
is calculated using the technique presented in the previous 
section. Then, the likelihood of obtaining such a normality 
score is computed. 

Let p(x) denote the density function of the normality 
scores of the normal instances. Assuming that the normality 
scores in S are drawn i.i.d. from p(x), the density function 
p(x) can be approximated by a single univariate model. This 
is done using the same technique presented in Equation [l] 

In order to calculate the likelihood of obtaining s(x*), the 
algorithm computes how anomalous s(x*) is with respect to 
S. This is done by approximating the density function p(x) 
using an additional univariate density approximate, with the 
parameters s(x*) and S: 



1 M 

d(s(x*)\S)= 



-rstso^-Sf**)) 2 



7[ \J2-kos 1 



(3) 



The value obtained by Equation[3]is the output of ADIFA. 
In case a classification is needed, this value can be thresh- 
olded by a user defined value < C < 1. In this case, ADIFA 
predicts anomaly if d(s(a;*)|S) > C and normal otherwise. 

3.6 Anomaly Localization Strategy 
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Figure 5: ADIFA training process 



Figure 6: ADIFA on-line classification process 



Identifying the location of an anomaly within the transac- 
tion document is a task that the system or service security 
officer has to deal with when a transaction is suspected of 
being anomalous. Locating the anomaly can be very impor- 
tant, particularly when the anomaly source may indicate an 
ongoing attack and since the location of an attack-related 
anomaly is known, some mitigation measures can be de- 
ployed. It is also important in the post-attack period, since 
the location information may provide the forensic expert 
with helpful facts regarding the attacker's sources, attack 
methods and propagation. Although such localization will 
not provide a complete operational meaning, such as the 
anomaly semantic, it is a step closer to this goal. 

Once an anomaly is detected, the anomaly localization is 
straightforward. Since it is a multi-univariate classifier, the 
ADIFA classifier has full knowledge regarding the normal- 
ity score of each of the test instance dimensions. When an 
anomalous transaction is detected, the ADIFA classifier lo- 
calizes the anomaly by identifying those features with the 
lowest normality scores. Additionally, the classifier can out- 
put a list of all features in order of their normality scores. 

4. METHODS 

In this Section we specify the methods for evaluating the 
algorithms proposed in this paper. First, in Section 4.1 
we describe the classifiers we used. Then in Section 4.2 
the performance metrics are detailed. Next, in Section 4.3 
we present the datasets used in our experiments. Lastly, in 
section |4~4] we present our synthesizer application for seeding 
anomalies in normal XML transactions. 

4.1 Classifiers 

We made use of six classifiers trained by four anomaly 
detection (one-class) algorithms: OC-GDE, OC-PGA, OC- 
SVM 34 , and ADIFAQ We selected these classifiers, as 
they represent the prominent branches of one-class algo- 
rithms: density-based (OC-GDE, OC-PGA) and boundary 
(OC-SVM). The first two algorithms are our own adapta- 
tions of two well-known unsupervised algorithms to one-class 
learning. . Table [JJ specifies the setup parameters of all six 
classifiers used during the following experiments. 

x The implementation o f the me ntioned algorithms can 
be downloaded from: http://sourceforge.net/projects/ 
xml-ad/f iles/AD . rar/download 



One-Class Peer Group Analysis 

The One-Class Peer Group Analysis (OC-PGA) is an adap- 
tation of the unsupervised Peer-Group- Analysis method (PGA), 
proposed by Eskin et al. [17] for the one-class learning do- 
main. The algorithm identifies anomalies as points in low- 
density regions of the feature space. An anomaly score is 
computed at a point i as a function of the distances from 
x to its k nearest neighbors. Although PGA is actually a 
ranking technique applied to a clustering problem, we im- 
plemented it as a one-class classifier. Given the training 
sample S, a test point x is classified as follows. For each 
Xi £ S, we pre-compute the distance to Xi's nearest neigh- 
bor in S, given by: di = (xi, S \ {a;*})- To classify x, the 
distance to the nearest neighbor of x in S, d x = d(x, S) is 
computed. The test point x is classified as an anomaly if 
d x — d(x, S) appears in a percentile a or higher among the 
{di}; otherwise, it is classified as normal. 

One-Class Global Density Estimation 

The Global Density Estimation (GDE), proposed by |22| , 
is also an unsupervised density-estimation technique which 
uses the nearest neighbors technique. Given a training sam- 
ple S and a real value r, one computes the anomaly score of 
a test point x by comparing the number of training points 
falling within the r-ball B r (x) about x to the average of 
\B r (xi) n S\ over all x% G S. We set r to be twice the 
sample average of d x — (xi,S \ {xi}) to ensure that the 
average number of neighbors is at least one. In order to 
adapt the GDE into the one-class domain (OC-GDE), we 
used a heuristic function for thresholding anomaly scores. 
We chose the following since it seemed to achieve a low 
classification error on the data: x is labeled as normal if 
e(— ((n r (x) — N T )/o~ r ) > 1/2 where n r (x) is the number of 
r-neighbors of x (points at a distance no higher than r from 
a;) in S, N r is the average number of r-neighbors over the 
training points, and ay is the sample standard deviation of 
the number of r-neighbors. 

4.2 Performance Metrics 

The main measure used to evaluate the classification per- 
formance of XML- AD was the area under the receiver op- 
erating characteristic (ROC) curve, which is a graphical 
plot of the specificity vs. 1-sensitivity for a classifier sys- 
tem since its discrimination threshold is varied. The ROC 



Algorithm Classifier Acronym Parameters 

ADIFA ADIFA-AM AD-AM * = Arithmetic Mean 

ADIFA ADIFA-HM AD-HM * = Harmonic Mean 

ADIFA ADIFA-GM AD-GM * = Geometric Mean 

OC-GDE OC-GDE GDE n/a 

OC-PGA OC-PGA PGA k = 1 (1-ncarest neighbor), a = 0.1 

OC-SVM OC-SVM 1-SVM Kernel = RBF (Gaussian), v = 0.05 



ARP-D 

The ARP ab bre viation stands for aAlJAddress Resolution 
ProtocolaAl [30]. The ARP-D dataset contains actual ARP 
spoofing attacks directed against the anonymized univer- 
sity's computer network. The dataset contains 9,039 in- 
stances and 24 attributes extracted from the link-layer frames. 
Each instance represents a single ARP packet that was sent 
through the network during the recording time. There were 
f 73 active computers on the local network, 27 of which were 
attacked. During the ARP attack, the attacker temporar- 
ily stole the IPv4 addresses of its victims and, as a result, 
their entire traffic was redirected to the attacker without the 
knowledge or consent of the victims. In order to experiment 
with different aggregation methods, two distinct datasets 
were produced, namely ARP-Di and ARP-D2, which dif- 
fer mainly in the featuresaAZ aggregation properties. The 
datasets contain an unusually high number of anomaly la- 
beled instances (attack), 75.2%, since the attacker produced 
many ARP messages in order to maintain the attack. The 
training instances were represented in XML format and their 
numerical fields induced an Euclidean vector representation. 

4.4 Malicious XML Transaction Synthesizer 

To properly evaluate XML-AD, it was necessary to use 
XML transaction documents from real systems. However, 
in many real-life systems, as was the case with the sys- 
tems from which our data was collected, anomalous XML 
documents are extremely rare, to the point where entire 
data collection instances can be safely presumed to be nor- 
mal. To overcome this problem, we implemented an XML 
transaction synthesizer that embeds the desirable number of 
anomalies in normal XML documents. In order to produce 
the XML anomalies, the synthesizer adds, deletes, and edits 
XML elements with content, and can embed new texts of 
varying length. In addition, the synthesizer can add known 
attacks, such as malicious scripts and SQL injections. The 
synthesizer, demonstrated in Figure]?] was used with the In- 
ventory and Insurance dataset to generate two-class derived 
datasets. 

As a result, all the datasets used in the proposed frame- 
work evaluation contained instances of two classes. We would 
like to point out that in order to train the anomaly detection 
classifiers, only instances of the class that represents the nor- 
mal XML documents were used. The other class instances 
were used later strictly for validation. 

5. EVALUATION 

The framework that we developed for evaluating ADIFA 
focused on I) whether our approach could effectively detect 
anomalies in XML documents and 2) how well could ADIFA 
perform in other domains. The evaluation framework we 
developed is presented in Figure [T] The following sections 
describe the experiments and results. 

5.1 Anomaly Detection Framework 

In this section we present our evaluation of the XML- AD 
framework in detecting anomalous XML transactions. In 
our experiment we used three datasets: Inventory and In- 
surance, which contain synthetic anomalies, and the ARP- 
D, containing instances of a real cyber-attack and presum- 
ably XML anomalies. The experiment consists of two parts. 
First, we examine the anomaly detection performance of the 



Table 1: Classifier's setup parameters. The param- 
eters shown are only those that are non-default. 

can also be represented equivalently by plotting the fraction 
of true positives (TPR = true positive rate) vs. the frac- 
tion of false positives (FPR = false positive rate). An ROC 
analysis provides tools for selecting possible optimal models 
and discarding suboptimal ones independently from (and 
prior to specifying) the cost context or the class distribu- 
tion. ROC analysis is related in a direct and natural way to 
cost/benefit analysis in diagnostic decision-making. Widely 
used for many decades in medicine, radiology, psychology, 
and other areas , it has been introduced relatively recently 
into machine-learning and data mining. In order to estimate 
the area under ROC (AUC), a 5x2 cross-validation proce- 
dure was performed [13] . In each of the cross-validation 
iterations, the training set was randomly partitioned into 
two disjoint instance subsets. In the first fold, the first sub- 
set was utilized as the training set, while the second subset 
was utilized as the testing set. In the second fold, the role 
of the two subsets switched. This process was repeated five 
times. The same cross-validation folds were implemented 
for all algorithms in all experiments. 

The one-tailed paired i-test with a confidence level of 95% 
verified whether the differences in AUC between tested clas- 
sifiers were statistically significant. In order to conclude 
which classifier performed best over multiple datasets, we 
followed the procedure proposed in [12]. We first used the 
adjusted Friedman test in order to reject the null hypothesis, 
followed by the Bonferroni-Dunn test to examine whether a 
specific classifier produces significantly better AUC results 
than the reference method. 

4.3 Datasets 

To evaluate the XML- AD framework we used four distinct 
datasets, of which details are discussed in the following sev- 
eral paragraphs. 

XML-Transactions 

The XML-transactions collection is comprised from instances 
of two domains, insurance and logistics. The first dataset 
contains insurance transactions taken from a real insurance 
information system that follows the ACORD standard [I]. 
This dataset contains 3,340 transactions which are labeled 
as normal. Since the insurance XML transactions contain 
many private data items, a pre-process of anonymization was 
made so as to ensure the privacy of the insurees. The second 
dataset, Inventory, is a collection of transactions related to 
a logistic information system that mainly consists of supply 
data. The dataset contains 4,000 transaction, all labeled as 
normal. Both XML transactions datasets were put through 
a process of feature extraction and dataset-flattening, de- 
scribed in Sections |3.1| and |3.2| This process produced 1,021 
and 285 features from the transactions of Insurance and In- 
ventory datasets, respectively and regardless of the original 
transaction size or the number of elements. 



<Holding> 
<PntAmt>3500</PntAmt> 
<Name>Eduard N.</Name> 
<lssueDate>2008-03-ll</lssueDate> 

</Holding> 


<Holding> 

<PntAmt>9982</PntAmt> 

<Name>Eduard N.</Name> 

<lssueDate>1999-ll-03</lssueDate> 
</Holding> 


<Holding> 
<PntAmt>3500</PntAmt> 
<Name> a secret msg</Name> 
<lssueDate>2008-03-ll</lssueDate> 

</Holding> 


(a) Source XML file 


(b) Value tampering 


(c) Using text elements for 
leaking information 


<Holding> 
<PntAmt>3500</PntAmt> 
<Name>Eduard N.</Name> 
<lssueDate>2008-03-ll</lssueDate> 
<Malicious Node!!!> 

</Holding> 


<Holding> 
<PntAmt>3500</PntAmt> 
<Name>Eduard N.</Name> 
<lssueDate>2008-03-ll</lssueDate> 

</Holding> 
<SCRIPT ...> ... </SCRIPT> 


<Holding> 

<PntAmt>3500</PntAmt> 

<Name>' or 1=1 -' 
</Name> 

<lssueDate>2008-03-ll</lssueDate> 
</Holding> 


(d) New node insertion 


(e) Malicious script 


(f) SQL injection 



Figure 7: Five manipulations made on source XML file, shown in (a), using the XML Transaction Synthesizer 



six abovementioned classifiers. Next, we examine the detec- 
tion performance as a function of the fraction of anomalous 
elements in the XML transaction Table [2] shows the results 
for the first experiment and Figure [8] depicts the classifiers' 
ROC curves. 





Anomalous 






Classifi 


3rs 






Datascts 


xml elements 


AD- AM AD-GM AD-HM GDE 


1-SVM PGA 




i% 


0.527 


0.924 


0.561 


0.501 


0.498 


0.524 


Insurance 


5% 


0.579 


0.967 


0.789 


0.498 


0.574 


0.532 




10% 


0.698 


0.971 


0.952 


0.504 


0.589 


0.539 




Average 


0.601 


0.954 


0.767 


0.501 


0.554 


0.532 




1% 


0.510 


0.534 


0.881 


0.537 


0.500 


0.527 


Inventory 


5% 


0.532 


0.668 


0.948 


0.562 


0.501 


0.642 




10% 


0.561 


0.788 


0.974 


0.575 


0.501 


0.744 




Average 


0.535 


0.663 


0.934 


0.558 


0.501 


0.638 


ARP-Di 


n/a 


0.873 


0.871 


0.799 


0.634 


0.643 


0.929 


ARP-D 2 


n/a 


0.900 


0.910 


0.926 


0.635 


0.635 


0.936 




Average 


0.887 


0.891 


0.863 


0.634 


0.639 


0.933 


Total 




0.648 


0.829 


0.854 


0.556 


0.555 


0.672 



Table 2: Average AUC result for the XML transac- 
tions datasets 

The results show that ADIFA-HM achieves the highest 
average AUC among all the tested classifiers. The next best 
performance was achieved by ADIFA-GM (AD-GM). Both 
are substantially better than the other four classifiers. The 
GDE and PGA classifiers performed the worst, with an av- 
erage AUC close to 0.5, which is only slightly better than a 
random classifier. By analyzing the results, it is clear that 
the type of the aggregation function plays a crucial part in 
ADIFA. In datasets where the dimensionality is relatively 
low (ARP-D and Inventory), the geometric-mean yields a 
better classifier, whereas in a higher-dimensionality dataset, 
such as with Insurance, ADIFA performs better with the 
harmonic-mean aggregation function. Furthermore, the re- 
sults show that all three multivariate classifiers, i.e., GDE, 
PGA, and I-SVM, performed very poorly on Insurance and 
Inventory, the datasets which contain genuine XML trans- 
actions. These results supports our hypothesis regarding 
the uselessness of multivariate anomaly detection models 
(models that use multivariate distance function) for detect- 
ing anomalies in XML transactions. 

In the following experiment, we tested the classifiers' re- 
sponsiveness to XML anomalies. This experiment had two 
goals: 1) to learn how responsive each classifier was to dif- 
ferent levels of transaction abnormalities and 2) to find in- 
teresting trends, such as, for example, determining which 
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Figure 8: ROC plot for Inventory dataset with 10% 
anomalous XML elements in abnormal transactions 

classifier was the most effective for detecting weak anoma- 
lies and which was preferable for stronger anomalies. Such a 
scenario is indeed possible since the classifier improvement 
rate, as a function of the percentage of anomalous element, 
can differ from one classifier to the other. If such trend were 
to be found, an ensemble of classifiers would probably offer 
the best anomaly detection solution. To accomplish this ex- 
periment our six classifiers were applied to ten variations of 
the Insurance and Inventory datasets, each with a distinct 
percent of anomalous XML elements (1 to 10 percent). The 
experimental results are shown in Figures |9| and [TO] respec- 
tively. 
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Figure 9: AUC vs. anomalous XML elements per- 
centage (Insurance) 
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Figure 10: AUC vs. anomalous XML elements per- 
centage (Inventory) 

5.2 ADIFA: A General Anomaly Detection 
Algorithm? 

In this section we examine whether the ADIFA algorithm 
can perform anomaly detection in other domains as well as it 
did within the XML transaction domain. In other words, we 
wanted to know if the multi-univariate approach of ADIFA 
will still work in other classification domains. To answer 
this question we compared, in the following experiment, the 
performance of ADIFA with several other anomaly detection 
algorithms, as indicated in Table[T] on general datasets from 



the UCI repository 18 



We selected 32 popular datasets from the widely used UCI. 
The datasets vary across such dimensions as the number of 
target classes, instances, input features, and type (nominal, 
numeric). In order to have only two classes in the datasets, 
we carried out a pre-process to select instances from the two 
most prominent classes. The other instances were filtered 
out. Similar to the previous experiment, only instances of a 
single class (the first of the defined classes), where used for 
training, while the instances of the second class were used 
strictly for evaluation. The results are displayed in Table [3] 
their significance is presented in Table [4] 

The results show that the three ADIFA variations per- 
formed significantly better than all other tested classifiers 
(OC-GDE, OC-PGA, and 1-SVM). The non-parametric Bon- 
ferroni Dunn test shows that there was no real difference 
between ADIFA variations. This is mainly because most 
UCI datasets are low-dimensional where the used aggrega- 
tion functions effectiveness is somewhat similar. 

6. CONCLUSIONS AND FUTURE WORK 

This paper presented a new framework for detecting XML 
anomalies. Our experiments showed that the approach taken 
in XML-AD is very useful for detecting various types of 
anomalies, some of which originate from possible attacks on 
the structure and content of XML transaction documents. 

One of the foundational challenges we faced in this re- 
search was finding general and efficiently predictive XML 
features that could be extracted from any XML transac- 
tion. We devised an automatic feature extraction process 
in which both the XML content and structure features were 
addressed. 

A key feature of the proposed framework is its unique 
method for transforming complex XML features into a fixed- 
length feature- vector (i.e., instance flattening). This feature 
makes it possible to use general anomaly detection algo- 
rithms, which are readily available. The price for this XML 
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Table 3: Average AUC result for the UCI datasets. 
Inside the parenthesis is the AUC rank of the tested 
classifier. 

AD-AM AD-GM AD-HM GDE PGA 
AD-GM = 
AD-HM = 

!)<-! + + + 

PGA + + + 

1-SMV + + + + = 

Table 4: The significance of the difference between 
the classifiers using the AUC metric 

features transformation was relatively low, both in compu- 
tation time and in information-loss, since the most critical 
feature values (for deciding whether the instance is abnor- 
mal) were preserved during the flattening process. 

Most of the prominent existing XML anomaly detection 
algorithms are based on the association-rules or multivariate 
distance function, which both perform poorly in high dimen- 
sions. We therefore proposed a new algorithm, ADIFA, that 
is comprised of multiple univariate models. Our evaluation 
demonstrated ADIFA's performance superiority over three 
related algorithms (1-SVM, OC-PGA, and OC-GDE) both 
in detecting anomalies in XML transactions and in other 
domains. 

Future work may include evolving our XML-AD frame- 
work towards a transaction filtering system, which, in ad- 
dition to performing anomaly detection, could also prevent 
system attacks, i.e., a machine-learning based XML-firewall. 
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